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Attorney Docket No. 21 1 163 

IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

In re Application of: 

Alex Gammerman 
Volodya Vovk 

Art Unit: Unassigned 

Application No. 

Examiner: Unassigned 

Filed: 

For: DATA CLASSIFICATION APPARATUS 
AND METHOD THEREOF 

PRELIMINARY AMENDMENT 

Commissioner for Patents 
Washington, D.C. 20231 

Dear Sir: 

Prior to the examination of the above-identified patent application, please enter the 
following amendments and consider the following remarks. 

AMENDMENTS 

IN THE CLAIMS : 

Please cancel claims 1-9 without prejudice, and add new claims 10-18 as follows: 



10. Data classification apparatus comprising: 

an input device for receiving a plurality of training classified examples and at least 
one unclassified example; 

a memory for storing said classified and unclassified examples; 

an output terminal for outputting a predicted classification for said at least one 
unclassified example; and 

a processor for identifying the predicted classification of said at least one 
unclassified example 
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wherein the processor includes: 

classification allocation means for allocating potential classifications to each said 
unclassified example and for generating a plurality of classification sets, each said 
classification set containing said plurality of training classified examples and said at least 
one unclassified example with its said allocated potential classification; 

assay means for determining a strangeness value valid under the iid assumption for 
each said classification set; 

a comparative device for selecting the classification set to which the most likely 
allocated potential classification for said at least one unclassified example belongs, 
wherein said predicted classification output by the output terminal is said most likely 
allocated classification according to said strangeness values assigned by said assay 
means; and 

a strength of prediction monitoring device for determining a confidence value for 
said predicted classification on the basis of said strangeness value assigned by said assay 
means to one of said classification sets to which the second most likely allocated potential 
classification of said at least one unclassified example belongs. 

1 1 . Data classification apparatus as claimed in claim 10, wherein said 
processor further includes an example valuation device which determines individual 
strangeness values for each said training classified example and said at least one 
unclassified example having an allocated potential classification. 
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12. Data classification apparatus as claimed in claim 1 1, wherein Lagrange 
multipliers are used to determine said individual strangeness values. 

13. Data classification apparatus as claimed in claim 1 1, wherein said assay 
means determines a strangeness value for each said classification set in dependence on 
said individual strangeness values of each said example. 

14. Data classification apparatus comprising: 

an input device for receiving a plurality of training classified examples and at least 
one unclassified example; 

a memory for storing said classified and unclassified examples; 

stored programs including an example classification program; 

an output terminal for outputting a predicted classification for said at least one 
unclassified example; and 

a processor controlled by said stored programs for identifying the predicted 
classification of said at least one unclassified example, 
wherein said processor includes: 

classification allocation means for allocating potential classifications to each said 
unclassified example and for generating a plurality of classification sets, each said 
classification set containing said plurality of training classified examples and said at least 
one unclassified example with its allocated potential classification; 

assay means for determining a strangeness value valid under the iid assumption for 
each said classification set; 
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a comparative device for selecting the classification set to which the most likely 
allocated potential classification for said at least one unclassified example belongs, 
wherein the predicted classification output by said output terminal is the most likely 
allocated potential classification according to said strangeness values assigned by said 
assay means and 

a strength of prediction monitoring device for determining a confidence value for 
said predicted classification on the basis of said strangeness value assigned by said assay 
means to one of said classification sets to which the second most likely allocated potential 
classification of said at least one unclassified example belongs. 

15. A data classification method comprising: 

inputting a plurality of training classified examples and at least one unclassified 
example; 

identifying a predicted classification of said at least one unclassified example 
which includes, 

allocating potential classifications to each said unclassified example; 

generating a plurality of classification sets, each said classification set containing 
said plurality of training classified examples and said at least one unclassified example 
with its allocated potential classification; 

determining a strangeness value valid under the iid assumption for each said 
classification set; 
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selecting the said classification set to which the most likely allocated potential 
classification for said at least one unclassified example belongs, wherein said predicted 
classification is the most likely allocated potential classification in dependence on said 
strangeness values; 

determining a confidence value for said predicted classification on the basis of the 
strangeness value assigned to one of said classification sets to which the second most 
likely allocated potential classification for said at least one unclassified example belongs; 
and 

outputting said predicted classification for said at least one unclassified example 
and said confidence value for said predicted classification. 

16. A data classification method as claimed in claim 15, further including 
determining individual strangeness values for each said training classified example and 
said at least one unclassified example having an allocated potential classification. 

17. A data classification method as claimed in claim 15, wherein said selected 
classification set is selected without the application of any general rules determined from 
the said training set. 

1 8. A data carrier on which is stored a classification program for classifying 
data by performing the following steps: 
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generating a plurality of classification sets, each said classification set containing a 
plurality of training classified examples and at least one unclassified example that has 
been allocated a potential classification; 

determining a strangeness value valid under the iid assumption for each said 
classification set; 

selecting the classification set to which the most likely allocated potential 
classification for the said at least one unclassified example belongs, wherein the predicted 
classification is the most likely allocated potential classification in dependence on said 
strangeness values; and 

determining a confidence value for said predicted classification on the basis of 
said strangeness value assigned to one of said classification sets to which the second most 
likely allocated potential classification for said at least one unclassified example belongs. 



Respectfully submitted, 




Dennis K. Schlemmer, Reg. No. 24,703 
One of the Attorneys for Applicants) 
LEYDIG, VOIT & MAYER, LTD. 
Two Prudential Plaza, Suite 4900 
180 North Stetson 
Chicago, Illinois 60601-6780 
(312) 616-5600 (telephone) 
(312)616-5700 (facsimile) 



Date: May 8, 2001 



B6 



WO 00/28473 



09/831262 
Rec'dP0T/PTO 0 8 MAY 2001 

PCT/GB99/03737 



DATA CLASSIFICATION APPARATUS AND METHOD THEREOF 
BACKGROUND OF THE INVENTION 

The present invention relates to data classification apparatus and an 
5 automated method of data classification thereof that provides a universal 
measure of confidence in the predicted classification for any unknown 
input. Especially, but not exclusively, the present invention is suitable for 
pattern recognition, e.g. optical character recognition. 

in order to automate data classification such as pattern recognition 

10 the apparatus, usually in the form of a computer, must be capable of 

learning from known examples and extrapolating to predict a classification 
for new unknown examples. Various techniques have been developed 
over the years to enable computers to perform this function including, inter 
alia, discriminant analysis, neural networks, genetic algorithms and support 

1 5 vector machines. These techniques usually originate in two fields: machine 
learning and statistics. 

Learning machines developed in the theory of machine learning 
often perform very well in a wide range of applications without requiring any 
parametric statistical assumptions about the source of data (unlike 

20 traditional statistical techniques); the only assumption made is the iid 
assumption (the examples are generated from the same probability 
distribution independently of each other). A new approach to machine 
learning is described in US5640492, where mathematical optimisation 
techniques are used for classifying new examples. The advantage of the 

25 learning machine described in US5640492 is that it can be used for solving 
extremely high-dimensional problems which are infeasible for the 
previously known learning machines. 

A typical drawback of such techniques is that the techniques do not 
provide any measure of confidence in the predicted classification output by 

30 the apparatus. A typical user of such data classification apparatus just 
hopes that the accuracy of the results from previous analyses using 
benchmark datasets is representative of the results to be obtained from the 
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analysis of future datasets. 

Other options for the user who wants to associate a measure of 
confidence with new unclassified examples include performing experiments 
on a validation set, using one of the known cross-validation procedures, 
5 and applying one of the theoretical results about the future performance of 
different learning machines given their past performance. None of these 
confidence estimation procedures though provides any practicable means 
for assessing the confidence of the predicted classification for an individual 
new example. Known confidence estimation procedures that address the 

1 0 problem of assessing the confidence of a predicted classification for an 
individual new example are ad hoc and do not admit interpretation in 
rigorous terms of mathematical probability theory. 

Confidence estimation is a well-studied area of both parametric and 
non-parametric statistics. In some parts of statistics the goal is 

15 classification of future examples rather than of parameters of the model, 
which is relevant to the need addressed by this invention. In statistics, 
however, only confidence estimation procedures suitable for low- 
dimensional problems have been developed. Hence, to date 
mathematically rigorous confidence assessment has not been employed in 

20 high-dimensional data classification. 

SUMMARY OF THE INVENTION 
The present invention provides a new data classification apparatus 
and method that can cope with high-dimensional classification problems 
and that provides a universal measure of confidence, valid under the iid 

25 assumption, for each individual classification prediction made by the new 
data classification apparatus and method. 

The present invention provides data classification apparatus 
comprising: an input device for receiving a plurality of training classified 
examples and at least one unclassified example; a memory for storing the 

30 classified and unclassified examples; an output terminal for outputting a 
predicted classification for the at least one unclassified example; and a 
processor for identifying the predicted classification of the at least one 



WO 00/28473 



PCT/GB99/03737 



unclassified example wherein the processor includes: classification 
allocation means for allocating potential classifications to each unclassified 
example and for generating a plurality of classification sets, each 
classification set containing the plurality of training classified examples and 
5 the at least one unclassified example with its allocated potential 

classification; assay means for determining a strangeness value for each 
classification set; and a comparative device for selecting a classification set 
containing the most likely allocated potential classification for at least one 
unclassified example, whereby the predicted classification output by the 

10 output terminal is the most likely aiiocated potential classification, according 
to the strangeness values assigned by the assay means. 

In the preferred embodiment the processor further includes a 
strength of prediction monitoring device for determining a confidence value 
for the predicted classification on the basis of the strangeness value of a 

15 set containing the at least one unclassified example with the second most 
likely allocated potential classification. 

With the present invention the conventional data classification 
technique of induction learning and then deduction for new unknown data 
vectors is supplanted by a new transduction technique that avoids the need 

20 to identify any all encompassing general rule. Thus, with the present 
invention no multidimensional hyperplane or boundary is identified. The 
training data vectors are used directly to provide a predicted classification 
for unknown data vectors, in other words, the training data vectors 
implicitly drive classification prediction for an unknown data vector. 

25 It is important to note that with the present invention the measure of 

confidence is valid under the general iid assumption and the present 
invention is able to provide measures of confidence for even very high 
dimensional problems. 

Furthermore, with the present invention more than one unknown 

30 data vector can be classified and a measure of confidence generated 
simultaneously. 

in a further aspect the present invention provides data classification 
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apparatus comprising: an input device for receiving a plurality of training 
classified examples and at least one unclassified example; a memory for 
storing the classified and unclassified examples; stored programs including 
an example classification program; an output terminal for outputting a 
5 predicted classification for the at least one unclassified example; and a 
processor controlled by the stored programs for identifying the predicted 
classification of the at least one unclassified example wherein the 
processor includes: classification allocation means for allocating potential 
classifications to each unclassified example and for generating a plurality of 

10 classification sets, each classification set containing the plurality of training 
classified examples and the at least one unclassified example with its 
allocated potential classification; assay means for determining a 
strangeness value for each classification set; and a comparative device for 
selecting a classification set containing the most likely allocated potential 

15 classification for the at least one unclassified example, whereby the 
predicted classification output by the output terminal is the most likely 
allocated potential classification, according to the strangeness values 
assigned by the assay means. 

In a third aspect the present invention provides a data classification 

20 method comprising: 

inputting a plurality of training classified examples and at least one 
unclassified example; 

identifying a predicted classification of the at least one unclassified 
example which includes 

25 allocating potential classifications to each unclassified 

example; 

generating a plurality of classification sets each containing the 
plurality of training classified examples and the at least one unclassified 
example with an allocated potential classification; 
30 determining a strangeness value for each classification set; 

and 

selecting, according to the assigned strangeness values, a 
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classification set containing the most likely allocated potential classification; 
and outputting the predicted classification for the at least one unclassified 
example whereby the predicted classification output by an output terminal 
is the most likely allocated potential classification. 
5 It will, of course, be appreciated that the above method and 

apparatus may be implemented in a data carrier on which is stored a 
classification program. 

BRIEF DESCRIPTION OF THE DRAWINGS 
An embodiment of the present invention will now be described by 
10 way of example oniy with reference to the accompanying drawings, in 
which: 

Figure 1 is a schematic diagram of data classification apparatus in 
accordance with the present invention; 

Figure 2 is a schematic diagram of the operation of data 
15 classification apparatus of Figure 1 ; 

Figure 3 is a table showing a set of training examples and 
unclassified examples for use with a data classifier in accordance with the 
present invention; and 

Figure 4 is a tabulation of experimental results where a data 
20 classifier in accordance with the present invention was used in character 
recognition. 

DESCRIPTION OF PREFERRED EMBODIMENT 

In Figure 1 a data classifier 10 is shown generally consisting of an 
input device 1 1, a processor 12, a memory 13, a ROM 14 containing a 

25 suite of programs accessible by the processor 12 and an output terminal 
15. The input device 1 1 preferably includes a user interface 16 such as a 
keyboard or other conventional means for communicating with and 
inputting data to the processor 12 and the output terminal 15 may be in the 
form of a display monitor or other conventional means for displaying 

30 information to a user. The output terminal 15 preferably includes one or 
more output ports for connection to a printer or other network device. The 
data classifier 10 may be embodied in an Application Specific Integrated 
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Circuit (ASIC) with additional RAM chips. Ideally, the ASIC would contain a 
fast RISC CPU with an appropriate Floating Point Unit. 

To assist in an understanding of the operation of the data classifier 
10 in providing a prediction of a classification for unclassified (unknown) 
5 examples, the following is an explanation of the mathematical theory 
underlying its operation. 

Two sets of examples (data vectors) are given: the training set 
consists of examples with their classifications (or classes) known and a test 
set consisting of unclassified examples. In Figure 3, a training set of five 
10 examples and two test examples are shown, where the unclassified 
examples are images of digits and the classification is either 1 or 7. 

The notation for the size of the training set is / and, for simplicity, it is 
assumed that the test set of examples contains only one unclassified 
example. Let (X,A) be the measurable space of all possible unclassified 
1 5 examples (in the case of Figure 3, X might be the set of all 1 6 x 1 6 grey- 
scale images) and (Y,B) be the measurable space of classes (in the case 
of Figure 3, Y might be the 2-element set {1 , 7}). Y is typically finite. 

The confidence prediction procedure is a family {/ p :pe(0, 1]} of 
measurable mappings fy(Xx Y) 1 x X-^B such that: 
20 1. For any confidence level fi (in data classification typically we are 

interested in p close to 1 ) and any probability distribution P in X x Y, the 
probability that 

JVi € //3<^n>'p-^",,>-,,A- ;+l ) 

25 is at least # where U,, >•,),..., U,,^),^,,^,) are generated independently 
from P. 



2. If # < fc, then, for all , y t , y,, Xl+] )<e(X*Y) i xX, 
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The assertion implicit in the prediction f Pi {x ] ,y ] ,...,x l ,y l ,x M ) is that the true 
label y /+1 will belong to / A (x t ,y ] ....,x l ,y,,x l . l ) . Item 1 requires that the 
prediction given by fp should be correct with probability at least (3, and item 
2 requires that the family {fp} should be consistent: if some label y for the 
5 {/+1)th example is allowed at confidence level it should also be allowed 
at any confidence level /3 2 > 

A typical mode of use of this definition is that some conventional 
value of f3 such as 95% or 99%, is chosen in advance, after which the 
function fp is used for prediction. Ideally, the prediction region output by fp 

10 will contain only one classification. 

An important feature of the data classification apparatus is defining 
fp in terms of solutions a„ i=1 , 1+1 , to auxiliary optimisation problems of 
the kind outlined in US5640492, the contents of which is incorporated 
herein by reference. Specifically, we consider I Yl completions of our data 

15 J,),-, (x,,y,),x M 

the completion y, ye Y, is 



(for notational convenience we write y /+ i in place of y here) is associated 
the optimisation problem 



(so in all completions every example is classified). 



20 



With every completion 

(x,,; y, ),..., (x„y,),(x M ,y M ) 




(1) 



(where C is a fixed positive constant) 
25 subject to the constraints 



This problem involves non-negative variables 0, which are called slack 



WO 00/28473 



PCT/GB99/03737 



8 

variables. If the constant C is chosen too large, the accuracy of solution 
can become unacceptably poor; C should be chosen as large as possible 
in the range in which the numericai accuracy of solution remains 
reasonable. (When the data is linearly separable, it is even possible to set 
5 C to infinity, but since it is rarely if ever possible to tell in advance that all 
completions will be linearly separable, C should be taken large but finite.) 

The optimisation problem is transformed, via the introduction of 
Lagrange multipliers a,-, *=1, to the dual problem: find a, from 

2 a . -^Ey.y^^j^. - x ,) — > ma * (3) 

10 under the "box" constraints 

0<a f - < C, 1=1,2, Af1 (4) 
The unclassified examples are represented, it is assumed, as the values 
taken by n numerical attributes and so X= Ff. 

This quadratic optimisation problem is applied not to the attribute 
1 5 vectors */ themselves, but to their images V(xl) under some predetermined 
function V:X-^> H taking values in a Hilbert space, which leads to replacing 
the dot product x, x y in the optimisation problem (3) — (4) by the kernel 
function 

KiXhX^Vix.) ■ V(xj) 
20 The final optimisation problem is, therefore, 

under the "box" constraints 

0 <o.i < C, /=1, 2, /+1 
this quadratic optimisation problem can be solved using standard 
25 packages. 

The Lagrange multiplier a,-, /e{1, reflects the "strangeness" 

of the example (Xi,yi); we expect that a /+ i will be targe in the wrong 
completions. 
For ye Y, define 
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| {/' \a t >a /+1 } 
7+1 



therefore d(y) is the p-value associated with the completion y (y being an 
alternative notation fory /+1 ). The confidence prediction function /, which is 
at the core of this invention, can be expressed as 



The most interesting case is where the prediction set given by /p is a 
singleton; therefore, the most important features of the confidence 
prediction procedure {/p} at the data (*,,)>, ),...,(*,, y,),x M are: 
• the iargest f3=(5 0 for which f p ((x t , y, ),..., (x n y,),x M ) is a singleton 



• the classification F((jc,,y, ),..., (x n y,),x M ) defined to be that ye Yfor 
which /^((x,,^,...,^,^),^,) is{y}. 

F^.o'i) (j:,o',),Jt, tl ) defined in this way is called the /-optimal 

1 5 prediction algorithm; the corresponding p 0 is called the confidence level 
associated with F, 

Another important feature of the confidence estimation function {f$} 
at the data (x, , y, ),..., (x, , y, ),x M is the iargest (3=j3* for which 

fp({x ] ,y l ),...,{x„y l ),x M ) is the empty set. We call 1-p* the credibility of the 
20 data set (x, , j, ),..., (x, , y, ), x M ; it is the p-vaiue of a test for checking the iid 
assumption. Where the credibility is very small, either the training set 
(x,, >•,),..., (x,, j,) or the new unclassified example x M are untypical, which 
renders the prediction unreliable unless the confidence ievel is much closer 
to 1 , than is 1 -p\. In general, the sum of the confidence and credibility is 

25 between 1 and 2; the success of the prediction is measured by how close 
this sum is to 2. 



5 



f p (x, , j, ,...,x, , y,,x M ) := {y : d{y) > 1 - 0} 



10 



(assuming such a f3 exists); 



With the data classifier of the present invention operated as 
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described above, the following menus or choices may be offered to a user: 

1 . Prediction and Confidence 

2. Credibility 

3. Details. 

A typical response to the user's selection of choice 1 might be 
prediction: 4, confidence: 99%, which means that 4 will be the prediction 
output by the /-optimal F and 99% is the confidence level of this prediction. 
A typical response to choice 2 might be credibility: 100%, which gives the 
computed value of credibility. A typical response to choice 3 might be: 

0123 4 56789 

0.1% 1% 0.2% 0.4% 100% 1.1% 0.6% 0.2% 1% 1% 
the complete set of p-values for ail possible completions. The latter choice 
contains the information about P{(x li y l ),...,{x l ,y,) t x M ) (the character 
corresponding to the largest p-vaiue), the confidence level (one minus the 
second largest p-value) and the credibility (the largest p-value). 

This mode of using the confidence prediction function / is not the 
only possible mode: in principle it can be combined with any prediction 
algorithm. If G is a prediction algorithm, with its prediction 
y-=G((x ] ,y^),...,(x l ,y / ),x l+t ) we can associate the following measure of 
confidence: 



The prediction algorithm F described above is the one that optimises this 
measure of confidence. 

The table shown in Figure 4 contains the results of an experiment in 
character recognition using the data classifier of the present invention. The 
table shows the results for a test set of size 1 0, using a training set of size 
20 (not shown). The kernel used was K(x, y) = (x ■ y) 3 1 256 . 

It is contemplated that some modifications of the optimisation 
problem set out under equations (I) and (2) might have certain advantages, 
for example, 
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— O- w) + Cj^^ 2 j-4 min, 

subject to the constraints 

y ! ((x r w) + b) = l-% i ,i = l,...J + l 
It is further contemplated that the data classifier described above 
5 may be particularly useful for predicting the classification of more than one 
example simultaneously; the test statistic used for computing the p-values 
corresponding to different completions might be the sum of the ranks of as 
corresponding to the new examples (as in the Wiicoxon rank-sum test). 

In practice, as shown in Figure 2, a training dataset is input 20 to the 

10 data classifier. The training dataset consists of a plurality of data vectors 
each of which has an associated known classification allocated from a set 
of classifications. For example, in numerical character recognition, the set 
of classifications might be the numerical series 0 — 9. The set of 
classifications may separately be input 21 to the data classifier or may be 

15 stored in the ROM 14. In addition, some constructive representation of the 
measurable space of the data vectors may be input 22 to the data classifier 
or again may be stored in the ROM 14. For example, in the case of 
numerical character recognition the measurable space might consist of 
16x16 pixellated grey-scale images. Where the measurable space is 

20 already stored in the ROM 14 of the data classifier, the interface 16 may 
include input means (not shown) to enable a user to input adjustments for 
the stored measurable space. For example, greater definition of an image 
may be required in which case the pixellation of the measurable space 
could be increased. 

25 One or more data vectors for which no classification is known are 

also input 23 into the data classifier. The training dataset and the 
unclassified data vectors along with any additional information input by the 
user are then fed from the input device 1 1 to the processor 12. 

Firstly, each one of the one or more unclassified data vectors is 

30 provisionally individually allocated 24 a classification from the set of 
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classifications. An individual strangeness value a, is then determined 25 
for each of the data vectors in the training set and for each of the 
unclassified data vectors for which a provisional classification allocation 
has been made. A classification set is thus generated containing each of 
5 the data vectors in the training set and the one or more unclassified data 
vectors with their allocated provision classifications and the individual 
strangeness values a, for each data vector. A plurality of such classification 
sets is then generated with the allocated provisional classifications of the 
unclassified data vectors being different for each classification set. 

10 Computation of a single strangeness value, the p-value, for each 

classification set containing the complete set of training data vectors and 
unclassified vectors with their current allocated classification is then 
performed 26, on the basis of the individual strangeness values a, 
determined in the previous step. This p-value and the associated set of 

1 5 classifications is transferred to the memory 13 for future comparison whilst 
each of the one or more unclassified data vectors is provisionally 
individually allocated with the same or a different classification. The steps 
of calculating individual strangeness values 25 and the determination of a 
p-value 26 are repeated in each iteration for the complete set of training 

20 data vectors and the unclassified data vectors, using different classification 
allocations for the unclassified data vectors each time. This results in a 
series of p-values being stored in the memory 13 each representing the 
strangeness of the complete set of data vectors with respect to unique 
classification allocations for the one or more unclassified data vectors. 

25 The p-values stored in the memory are then compared 27 to identify 

the maximum p-value and the next iargest p-value. Finally, the 
classification set of data vectors having the maximum p-value is supplied 
28 to the output terminal 15. The data supplied to the output terminal may 
consist solely of the classification(s) allocated to the unclassified data 

30 vector(s), which now represents the predicted classification, from the 
classification set of data vectors having the maximum p-value. 

Furthermore, a confidence value for the predicted classification is 



WO 00/28473 



PCT/GB99/03737 



13 

generated 29. The confidence value is determined based on the 
subtraction of the next largest p-vaiue from 1 . Hence, if the next largest p- 
value is large, the confidence of the predicted classification is small and if 
the next largest p-vaiue is small, the confidence value is large. Choice 1 
5 referred to earlier, provides a user with predicted classifications for the one 
or more unknown data vectors and the confidence value. 

Where an alternative prediction algorithm is to be used, the 
confidence value will be computed by subtracting from 1 the largest p-value 
for the sets of training data vectors and new vectors classified differently 

10 from the predicted (by the alternative method) classification. 

Additional information in the form of the p-values for each of the sets 
of data vectors with respect to the individual allocated classifications may 
also be supplied (choice 3) or simply the p-value for the predicted 
classification (choice 2). 

15 With the data classifier and method of data classification described 

above, a universal measure of the confidence in any predicted 
classification of one or more unknown data vectors is provided. Moreover, 
at no point is a general rule or multidimensional hyperplane extracted from 
the training set of data vectors. Instead, the data vectors are used directly 

20 to calculate the strangeness of a provisionally allocated classification(s) for 
one or more unknown data vectors. 

While the data classification apparatus and method have been 
particularly shown and described with reference to the above preferred 
embodiment, it will be understood by those skilled in the art that various 

25 modifications in form and detail may be made therein without departing 
from the scope and spirit of the invention. Accordingly, modifications such 
as those suggested above, but not limited thereto, are to be considered 
within the scope of the invention. 

30 
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CLAIMS 

Data classification apparatus comprising: 

an input device for receiving a plurality of training 

classified examples and at least one unclassified 

example; 

a memory for storing the classified and unclassified 
examples; 

an output terminal for outputting a predicted 
classification for the at least one unclassified example; 
and 

a processor for identifying the predicted classification of 
the at least one unclassified example 
wherein the processor includes: 

classification allocation means for allocating potential 
classifications to each unclassified example and for 
generating a plurality of classification sets, each 
classification set containing the plurality of training 
classified examples and the at least one unclassified 
example with its allocated potential classification; 
assay means for determining a strangeness value valid 
under the iid assumption for each classification set; 
a comparative device for selecting the classification set to 
which the most likely allocated potential classification for 
the at least one unclassified example belongs, wherein 
the predicted classification output by the output 
terminal is the most likely allocated classification 
according to the strangeness values assigned by the 
assay means; and 
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a strength of prediction monitoring device for 
determining a confidence value for the predicted 
classification on the basis of the strangeness value 
assigned by the assay means to one of the classification 
sets to which the second most likely allocated potential 
classification of the at least one unclassified example 
belongs. 

Data classification apparatus as claimed in claim 1, 
wherein the processor further includes an example 
valuation device which determines individual 
strangeness values for each training classified example 
and the at least one unclassified example having an 
allocated potential classification. 

Data classification apparatus as claimed in claim 2, 
wherein Lagrange multipliers are used to determine the 
individual strangeness value. 

Data classification apparatus as claimed in claim 2, 
wherein the assay means determines a strangeness value 
for each classification set in dependence on the 
individual strangeness values of each example. 

Data classification apparatus comprising: 

an input device for receiving a plurality of training 

classified examples and at least one unclassified 

example; 

a memory for storing the classified and unclassified 
examples; 
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stored programs including an example classification 
program; 

an output terminal for outputting a predicted 
classification for the at least one "unclassified example; 
and 

a processor controlled by the stored programs for 
identifying the predicted classification of the at least one 
unclassified example wherein the processor includes; 
classification allocation means for allocating potential 
classifications to each unclassified example and for 
generating a plurality of classification sets, each 
classification set containing the plurality of training 
classified examples and the at least one unclassified 
example with its allocated potential classification; 
assay means for determining a strangeness value valid 
under the iid assumption for each classification set; 
a comparative device for selecting the classification set to 
which the most likely allocated potential classification for 
the at least one unclassified example belongs, wherein 
the predicted classification output by the output 
terminal is the most likely allocated potential 
classification according to the strangeness values 
assigned by the assay means and 

a strength of prediction monitoring device for 
determining a confidence value for the predicted 
classification on the basis of the strangeness value 
assigned by the assay means to one of the classification 
sets to which the second most likely allocated potential 
classification of the at least one unclassified example 
belongs. 
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A data classification method comprising: 

inputting a plurality of training classified examples and 

at least one unclassified example; 

identifying a predicted classification of the at least one 
unclassified example which includes, 

allocating potential classifications to each unclassified 
example; 

generating a plurality of classification sets, each 
classification set containing the plurality of training 
classified examples and the at least one unclassified 
example with its allocated potential classification; 
determining a strangeness value valid under the iid 
assumption for each classification set; 

selecting the classification set to which the most likely 
allocated potential classification for the at least one 
unclassified example belongs, wherein the predicted 
classification is the most likely allocated potential 
classification in dependence on the strangeness values; 
determining a confidence value for the predicted 
classification on the "basis of the strangeness value 
assigned to one of the classification sets to which the 
second most likely allocated potential classification for 
the at least one unclassified example belongs; and 
outputting the predicted classification for the at least 
one unclassified example and the confidence value for 
the predicted classification. 

A data classification method as claimed in claim 6, 
further including determining individual strangeness 
values for each training classified example and the at 
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least one unclassified example having an allocated 
potential classification. 

A data classification method as claimed in any one of the 
preceding claims, wherein the selected classification set 
is selected without the application of any general rules 
determined from the training set. 

A data carrier on which is stored a classiiication program 
for classifying data by performing the following steps: 
generating a plurality of classification sets, each 
classification set containing a plurality of training 
classified examples and at least one unclassified example 
that has been allocated a potential classification; 
determining a strangeness value valid under the iid 
assumption for each classification set; 
selecting the classification set to which the most likely 
allocated potential classification for the at least one 
unclassified example belongs, wherein the predicted 
classification is the most likely allocated potential 
classification in dependence on the strangeness values; 
and 

determining a confidence value for the predicted 
classification on the basis of the strangeness value 
assigned to one of the classification sets to which the 
second most likely allocated potential classification for 
the at least one unclassified example belongs. 
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